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Abstract 

Background: Microarray technology is widely used in cancer diagnosis. Successfully identifying gene biomarkers 
will significantly help to classify different cancer types and improve the prediction accuracy. The regularization 
approach is one of the effective methods for gene selection in microarray data, which generally contain a large 
number of genes and have a small number of samples. In recent years, various approaches have been developed 
for gene selection of microarray data. Generally, they are divided into three categories: filter, wrapper and 
embedded methods. Regularization methods are an important embedded technique and perform both continuous 
shrinkage and automatic gene selection simultaneously. Recently, there is growing interest in applying the 
regularization techniques in gene selection. The popular regularization technique is Lasso (L^, and many l^ type 
regularization terms have been proposed in the recent years. Theoretically, the lq type regularization with the 
lower value of q would lead to better solutions with more sparsity. Moreover, the L 1/2 regularization can be taken 
as a representative of lq (0 < q < 1) regularizations and has been demonstrated many attractive properties. 

Results: In this work, we investigate a sparse logistic regression with the L 1/2 penalty for gene selection in cancer 
classification problems, and propose a coordinate descent algorithm with a new univariate half thresholding operator 
to solve the L 1/2 penalized logistic regression. Experimental results on artificial and microarray data demonstrate the 
effectiveness of our proposed approach compared with other regularization methods. Especially, for 4 publicly available 
gene expression datasets, the L 1/2 regularization method achieved its success using only about 2 to 14 predictors 
(genes), compared to about 6 to 38 genes for ordinary l } and elastic net regularization approaches. 

Conclusions: From our evaluations, it is clear that the sparse logistic regression with the L 1/2 penalty achieves higher 
classification accuracy than those of ordinary l^ and elastic net regularization approaches, while fewer but informative 
genes are selected. This is an important consideration for screening and diagnostic applications, where the goal is 
often to develop an accurate test using as few features as possible in order to control cost. Therefore, the sparse 
logistic regression with the L 1/2 penalty is effective technique for gene selection in real classification problems. 
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Background 

With the development of DNA microarray technology, the 
biology researchers can analyze the expression levels of 
thousands of genes simultaneously. Many studies have 
demonstrated that microarray data are useful for classifica- 
tion of many cancers. However, from the biological per- 
spective, only a small subset of genes is strongly indicative 
of a targeted disease, and most genes are irrelevant to can- 
cer classification. The irrelevant genes may introduce noise 
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and decrease classification accuracy. Moreover, from the 
machine learning perspective, too many genes may lead to 
overfitting and can negatively influence the classification 
performance. Due to the significance of these problems, ef- 
fective gene selection methods are desirable to help to clas- 
sify different cancer types and improve prediction accuracy. 

In recent years, various approaches have been developed 
for gene selection of microarray data. Generally, they are di- 
vided into three categories: filter, wrapper and embedded 
methods. Filter methods evaluate a gene based on discrim- 
inative power without considering its correlations with other 
genes [1-4]. The drawback of filter methods is that it exam- 
ines each gene independently, ignoring the possibility that 
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groups of genes may have a combined effect which is not ne- 
cessarily reflected by the individual performance of genes in 
the group. This is a common issue with statistical methods 
such as T-test, which examine each gene in isolation. 

Wrapper methods utilize a particular learning method 
as feature evaluation measurement to select the gene 
subsets in terms of the estimated classification errors 
and build the final classifier. Wrapper approaches can 
obtain a small subset of relevant genes and can signifi- 
cantly improve classification accuracy [5,6]. For example, 
Guyon et al. [7] proposed a gene selection approach util- 
izing support vector machines (SVM) based on recursive 
feature elimination. However, the wrapper methods 
greatly require extensive computational time. 

The third group of gene selection procedures is embed- 
ded methods, which perform the variable selection as 
part of the statistical learning procedure. They are much 
more efficient computationally than wrapper methods 
with similar performance. Embedded methods have 
drawn much attention recently in the literature. The em- 
bedded methods are less computationally expensive and 
less prone to over fitting than the wrapper methods [8]. 

Regularization methods are an important embedded tech- 
nique and perform both continuous shrinkage and auto- 
matic gene selection simultaneously. Recently, there is 
growing interest in applying the regularization techniques in 
the logistic regression models. Logistic regression is a power- 
ful discriminative method and has a direct probabilistic inter- 
pretation which can obtain probabilities of classification 
apart from the class label information. In order to extract 
key features in classification problems, a series of regularized 
logistic regression methods have been proposed. For ex- 
ample, Shevade and Keerthi [9] proposed the sparse logistic 
regression based on the Lasso regularization [10] and Gauss- 
Seidel methods. Glmnet is the general approach for the L x 
type regularized (including Lasso and elastic net) linear 
model using a coordinate descent algorithm [11,12]. Similar 
to sparse logistic regression with the L x regularization 
method, Gavin C. C. and Nicola L. C. [13] investigated 
sparse logistic regression with Bayesian regularization. In- 
spired by the aforementioned methods, we investigate the 
sparse logistic regression model with a L 1/2 penalty, in par- 
ticular for gene selection in cancer classification. The L 1/2 
penalty can be taken as a representative of Lq (0 < q < 1) 
penalty and has demonstrated many attractive properties, 
such as unbiasedness, sparsity and oracle properties [14]. 

In this paper, we develop a coordinate descent algorithm 
to the L 1/2 regularization in the sparse logistic regression 
framework. The approach is applicable to biological data 
with high dimensions and low sample sizes. Empirical 
comparisons with sparse logistic regressions with the L 1 
penalty and the elastic net penalty demonstrate the effect- 
iveness of the proposed L 1/2 penalized logistic regression 
for gene selection in cancer classification problems. 



Methods 

Sparse logistic regression with the L 1/2 penalty 

In this paper, we focus on a general binary classification 
problem. Suppose we have n samples, D = {(X lf yi), (X 2 , 
y 2 ), (X w y n )}, where X i = {x il ,x i2 x ip ) is i th input 
pattern with dimensionality p and y t is a corresponding 
variable that takes a value of 0 or 1; y t = 0 indicates the 
z th sample in Class 1 and y t = 1 indicates the i th sample 
is in Class 2. The vector X t contains p features (for all 
p genes) for the I th sample and Xy denotes the value of 
gene / for the i th sample. Define a classifier^) = e x /(l + 
e x ) such that for any input x with class label y, fix) pre- 
dicts y correctly. The logistic regression is expressed as: 



P(Y, = l\Xi) =f{Xfi) 



exp(X^) 
1 + exp(X«yS) 



(1) 



Where = (fi 0 , fix,..., fi p ) are the coefficients to be esti- 
mated, note that fi 0 is the intercept. The log-likelihood is: 

l(fi\D) = -£ {y t log \f{Xfi)} + (1-y,) log[l-/(X ',/?)] } 

i=l 

(2) 

We can obtain ft by minimizing the log-likelihood (2). 
In high dimensional application with p » n, directly 
solving the logistic model (2) is ill-posed and may lead 
to overfitting. Therefore, the regularization approaches 
are applied to address the overfitting problem. When 
adding a regularization term to (2), the sparse logistic re- 
gression can be modelled as: 



0 = argminl l{fi\D) + aJ>(^) 



(3) 



Where A > 0 is a tuning parameter and P(B) is a 
regularization term. The popular regularization tech- 
nique is Lasso (Li) [10], which has the regularization 
term P(/?) = E|/?|. Many L x type regularization terms 
have been proposed in the recent years, such as SCAD 
[15], elastic net [16], and MC+ [17]. 

Theoretically, the Lq type regularization P(J3) = Z\/3\ q 
with the lower value of q would lead to better solutions 
with more sparsity. However when q is very close to 
zero, difficulties with convergence arise. Therefore, Xu 
et al. [14] further explored the properties of Lq (0 <q <1) 
regularization and revealed the extreme importance and 
special role of the L 1/2 regularization. They proposed 
that when l/2< q <1, the L 1/2 regularization can yield 
most sparse results and its difficulty with convergence is 
not very high compared with that of the L 2 
regularization, while when 0< q <l/2, the performance 
of Lq penalties makes no significant difference and solv- 
ing the L 1/2 regularization is much simpler than solving 
the L 0 regularization. Hence, the L 1/2 regularization can 
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be taken as a representative of Lq (0 < q < 1) 
regularizations. In this paper, we apply the L 1/2 penalty 
to the logistic regression model. The sparse logistic re- 
gression model based on the L 1/2 penalty has the form: 



0 1/2 = argminl Z0|D) +AjV/ 



/2 



(4) 



7=1 



The Lx/2 regularization has been demonstrated many at- 
tractive properties, such as unbiasedness, sparsity and oracle 
properties. The theoretical and experimental analyses show 
that the L 1/2 regularization is a competitive approach. Our 
work in this paper also reveals the effectiveness of the Li /2 
regularization to solve the nonlinear logistic regression prob- 
lems with a small number of predictive features (genes). 

A coordinate descent algorithm for the L 1/2 penalized 
logistic regression 

The coordinate descent algorithm [11,12] is a "one-at-a- 
time" approach, and its basic procedure can be described as 
follows: for each coefficients, to partially optimize the target 
function with respect to fifj - 1, 2, j?) with the remaining 
elements of fixed at their most recently updated values. 

Before introducing the coordinate descent algorithm 
for the nonlinear logistic regularization, we first consider 
a linear regularization case. Suppose the dataset D has n 
samples, D = {(X lf ji), (X 2 , j/ 2 ), pO?,y w )}, where X t = 
(xn> Xi P ) is i th input variables with dimensionality 

p and ji is the corresponding response variable. The 



variables are standardized: 



and Yj yi 



0. 



i=l 



Therefore, The linear regression with the regularization 
term can be expressed as: 



R(fi) = argmin < 



(5) 



Where P(B) is the regularization term. The coordinate 
descent algorithm solves y6y and other fi k ^ ; (k * j repre- 
sent the parameters remained after / h element is re- 
moved) are fixed. The equation (5) can be rewritten as: 



^ V k*j ) k*j 



(6) 



The first order derivative at can be estimated as: 



Define yf = ^XjkPk as tne P ar ti a l residual for fitting 

n 

and G)j = ^ ^{yj-y^j > the univariate soft thresholding 

i=l 

operator of the coordinate descent algorithm [11] for the 
regularization (Lasso) can be defined as: 



{coj + A ifcoj < -A 
Gjy-A ifcoj > A 
0 if\coj\ < A 



(8) 



Similarly, for the L 0 regularization, the thresholding oper- 
ator of the coordinate descent algorithm can be defined as: 



ft- = Hard {coj , A) = col ( | coj | > A) 



(9) 



where / is the indicator function. This formula is equivalent 
to the hard thresholding operator [17]. 

According to equations (8) and (9), we can know that the 
different penalties are associated with different thresholding 
operators. Therefore, Xu et al. [18] proposed a half 
thresholding operator to solve the L 1/2 regularization for 
linear regression model. It is an iterative algorithm and can 
be seen as multivariate half thresholding approach. In this 
paper, we propose the univariate half thresholding operator 
of the coordinate descent algorithm for the L 1/2 
regularization. Based on equation (7), the gradient of the 
L 1/2 regularization at can be expressed as: 



dR - B a I A 

— p raj+ x—=^ 



Firstly, we consider the > 0 statement, and let, y 
fij = When /5 y > 0, the equation (10) can be redefined as: 
A 



(10) 



[4 r-odjft + - 



0 



(11) 

0 < CO) < \t\ 



There are three cases of coj < 0, 

and coj > |A^ respectively. 

(i) If CO) < 0, the three roots of equation (11) can be 
expressed as follows: 

[t l = -2 r sin | , [i 2 = r sin | + iy/3 r cos | and 



03 



rsin |-r \f?Tr cos |, 



where r = y 3 , 0 = arccos^). When r > 0, none of 
the roots satisfices fii > 0. Thus, there is no solution to 
equation (11) when coj < 0. 

(ii) If 0 < coj < I A^, the three roots of equation (11) are: 



01 



-2 r cos 



02 



r cos 



2 \/3^ sin I and 



f* 3 = r cos I -r \/3^ sin |. 



Liang et al. BMC Bioinformatics 2013, 14:198 
http://www.biomedcentral.com/1471 -21 05/1 4/1 98 



Page 4 of 12 



There is still no solution to equation (11) in this case, 
(iii) If o)j > |A 5 , the three roots of equation (11) are 
given by: 



•fi 1 = -2 r cos — , ^ 



2r cos 



and 



2r cos 



In this case, the fi 2 is a unique solution of equation (10). 
Thus, the equation (11) has non-zero roots only when o)j 
> | A 5 . The unique solution of equation (10) is as follow: 



2 (71 



0 (*>;)) 



On the other hand, in the < 0 statement, we de- 
/? ; = (A and /? ; = - (A 1 . The equation (10) can be 



noted 

transformed into the equation: 



(12) 



The equation (12) also has a unique solution 



when o)j < -f A 3 : 



1*2 = 2r cos (^y) and ft. 



1 



cos 



2(n-<p((0j)) 



In conclusion, the univariate half thresholding oper- 
ator can be expressed as: 

Pj=HaF(aj,\) 



0 



otherwise 



(13) 



where 0a (^) satisfies: 



cos(0 A (w)) 



8 V 3 



The coordinate descent algorithm for the L 1/2 
regularization makes repeated use of the univariate half 
thresholding operator. The details of the algorithm will 
be described later. This coordinate descent algorithm for 
the regularization can be extended to the sparse logistic 
regression model. Based on the objective function (3) of 
the sparse logistic regression, one-term Taylor series ex- 
pansion for 1(B) has the form of 



n P 
i=l j=l 



is an estima- 



Where Z, = X# -rj^^ 

ted response, W t =f(x i fj(l-f(x i fj > j is a weight 

and /(Xifty = exp(Xijty/(l + exp(x^ is a evalu- 
ated value at current parameters. Redefine the partial re- 



sidual for fitting current /? as zlf = jWi(Zi~Z 

n 

and ^ Xy ( Zj- Z ^ ) , we can directly apply the coordinate 

i=i 

descent algorithm with the L 1/2 penalty for sparse logistic re- 
gression and the details are given follows: 



Algorithm: The coordinate descent algorithm for sparse logistic with the L 1/2 penalty 
Step 1 : Initialize all (m) = 0 (J =1,2,. . .,p) and A, set m = 0; 

Step 2: Compute Z (m) and W(m) and approximate the loss function(14) based on the 
Current ft (m); 

Step 3: Update each fa (m), and cycle over j=l,...,p, until fij (m) does not change; 

n 

Step 3.1: Calculate z['\m) = Y W|(m)(Zi(m) - £ x ik p k (mj) 

n 

and (Oj(rn) = . ( z . (m ) _ Z- 0) (m)) ; 

i=l 

Step 3.2: Update #(/w) = Halfico/m), A); 
Step 4: Let m=m + l,0(m + 1) <- /?(m), 

repeat Steps 2, 3 until J (IftC™ + 1)1 -|ft(m)|) < 10 -8 . 
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The coordinate descent algorithm for the L 1/2 penal- 
ized logistic regression works well in the sparsity prob- 
lems, because the procedure does not need to change 
many irrelevant parameters and recalculate partial resid- 
uals for each update step. 

Results 

Analyses of simulated data 

In this section, we evaluate the performance of the 
sparse logistic regression with the L 1/2 penalty in simula- 
tion study. We generate high-dimensional and low sam- 
ple size data which contain many irrelevant features. 
Two methods are compared with our proposed ap- 
proach: Sparse logistic regression with the Elastic Net 
penalty (L EN ) and Sparse logistic regression with the 
Lasso penalty (l^). 

We generated the vectors Yio>Yii>>-->Yi P (* = in- 
dependently from the standard normal distribution 
and the predictor vector (z=l,...,#) is generated by 
Xij = Yijyf^-P + YioVP (/ =1 >—> P)> where p is the correl- 
ation coefficient of the predictor vectors [19]. The simu- 
lated data set generated from the logistic model: 




Where s is the independent random error generated 
from Af(0,l) and a is the parameter which controls the 
signal to noise. In every simulation, the dimension p of the 
predictor vector is 1000, and the first five true coefficients 
are nonzero: fa = 1, fa = h fa = -h fa = -h fa = h 
and fa = 0(6 <;< 1000). 

The estimation of the optimal tuning parameter A in 
the sparse logistic regression models can be done in 
many ways and is often done by /c-fold cross-validation 
(CV). Note that the choice of k will depend on the size 
of the training set. In our experiments, we use 10-fold 
cross-validation (/<=10). The elastic net method has two 
tuning parameters, we need to cross -validate on a two- 
dimensional surface [16]. 

We consider the cases with the training sample size 
n = 50, 80, 100, the correlation coefficient p =0.1, 0.4 
and the noise control parameter a =0.2, 0.6 respectively. 
Each classifier was evaluated on a test data set including 
100 samples. The experiments were repeated 30 times 
and we report the average test errors in Table 1. As 
shown in Table 1, when the sample size n increases, the 
prediction performances of all the three methods are im- 
proved. For example when p =0.1, and a =0.2, the aver- 
age test errors of the L 1/2 method are 28.2%, 10.7% and 
8.1% with the sample sizes n=50, 80, and 100 respect- 
ively. When the correlation parameter p and the noise 
parameter a increase, the prediction performances of all 
the three methods are decreased. For example, when 



Table 1 The average errors (%) for the test data sets 
obtained by the sparse logistic regressions with the 
l-i/2/ Len and Li penalties in 30 runs 



Sample size L-|/2 Len Li 



n 1 

p — U. I , 




ZO.Z 


31 Q 

5 I .o 


o I .z 


O = 0.2 


n=80 


10.7 


23.1 


22.2 




n=100 


8.1 


16.9 


15.7 


p = 0.1, 
o = 0.6 


n=50 


31.4 


33.1 


33.3 




n=80 


18.4 


27.1 


26.6 




n=100 


14.2 


22.4 


21.3 


p = 0.4, 
o = 0.2 


n=50 


30.1 


32.6 


33.0 




n=80 


11.1 


23.3 


22.9 




n=100 


9.1 


19.0 


16.4 


p = 0.4, 
o = 0.6 


n=50 


35.1 


35.5 


36.3 




n=80 


20.5 


27.2 


26.9 




n=100 


15.1 


22.7 


22.9 


p =0.4 and 


n =100, the average test errors from the L 1/2 


method increased from 9.1% to 15.1%, 


in which 


a in- 


creased from 0.2 to 0.6. When a =0.6 and n =80, the aver- 


age test error from the L 1/2 method increase from 18.4% 


to 20.5%, in which p increased from 0.1 to 0.4. Moreover, 


in our simulation, the influence of the noise may be larger 


than that of the variable correlation for the prediction per- 


formance of all the three methods. On the other hand, at 


the same parameter setting 


\ case, the prediction perform- 


ance of the L 1/2 method is 


consistent and better than the 


results of the L EN and L 2 


methods. For 


example, 


when 


p =0.1, a = 


0.2 and w=100, the predictive error of the L 1/2 


method is < 


3.1% much better than 16.9% and 15.7% \ 


2>°t by 


the L EN and L 2 methods respectively. 






Table 2 The average number of variables selected by the 


sparse logistic regressions with the L 1/2 , L| 


en and Lt 




penalties in 30 runs 










Sample size 


Ll/2 


Len 


Li 


p = 0.1, 


n=50 


7.5 


31.6 


27.1 


a = 0.2 


n=80 


8.8 


43.1 


40.3 




n=100 


8.9 


49.7 


45.7 


p = 0.1, 
a = 0.6 


n=50 


8.3 


33.6 


29.2 




n=80 


10.6 


45.7 


41.9 




n=100 


10.8 


54.4 


50.1 


p = 0.4, 
a = 0.2 


n=50 


7.8 


33.5 


28.3 




n=80 


8.9 


44.5 


41.8 




n=100 


9.0 


51.2 


46.6 


p = 0.4, 
a = 0.6 


n=50 


8.6 


41.3 


29.9 




n=80 


10.7 


45.9 


44.1 




n=100 


11.2 


56.4 


53.4 



Liang et al. BMC Bioinformatics 2013, 14:198 
http://www.biomedcentral.com/1471 -21 05/1 4/1 98 



Page 6 of 12 



Table 2 shows the average number of the variables 
selected in 30 runs for each method. Since the simu- 
lation datasets have relevant features, the ideal- 
ized average number of variables selected by each 
method is 5. In Table 2, the results obtained by the 
L 1/2 penalized method are obviously closed to 5 and 
3-10 times smaller than those of the L EN and L 2 
penalties at the same parameter setting. For example, 
when p =0.1, a =0.2 and «=100, the average numbers 
from the L EN and L 2 methods are 49.7 and 45.7 re- 
spectively, and the result of L 1/2 method is 8.9. 
Moreover, when the sample size n, the correlation 
parameter p, and the noise parameter a increase, 
the average numbers from all the three methods 
increase, but the values of the L EN and L 2 methods 
increase faster than those of the L 1/2 method. This 
means that the L 1/2 penalized method consistently 
outperforms than other two methods in term of 
variable selection. 

To further evaluate the performance of the L 1/2 pe- 
nalized method, we report the frequency with which 
each relevant variable was selected among 30 runs for 
each method in Table 3. When the sample size is 
small («=50), the L 1/2 penalty selects the relevant vari- 
ables slightly less frequently than the other two 
methods and all the three methods select true nonzero 
coefficients with difficulties, especially when p and o 
are relatively large. For example, when p =0.4, a =0.6, 
n=50, and for /? 5 , the selected frequencies of the L 1/2 , 
L EN and L 2 methods are 12, 14 and 13 respectively in 
30 runs. As n increases, all the three methods tend to 
select the true nonzero coefficients more accurately 
and the L 1/2 penalty method performs slightly better, 
in terms of variable frequencies, than the other two 
methods under the different parameter settings of p 
and o. To sum up, Tables 1, 2 and 3 clearly show that 
the L 1/2 method is winner among the competitors in 
terms of both prediction accuracy and variable selec- 
tion in the different variable correlation and noise 
situations. 

Analyses on microarray data 

In this section, we compare our proposed L 1/2 penalized 
method with the L EN and L 2 methods on 4 publicly 
available gene expression datasets: Leukaemia, Prostate, 
Colon and DLBCL. A brief description of these datasets 
is given below and summarized in Table 4. 



Table 3 The frequencies of the relevant variables 
obtained by the sparse logistic regressions with the 
l-i/2/ Len and L 1 penalties in 30 runs 

Sample size Method 



p = 0.1, 
o = 0.2 


n=50 


L1/2 
Len 


21 
24 


22 
25 


19 
21 


15 
17 


15 
17 






Li 


22 


24 


20 


15 


17 




n=80 


L1/2 


30 


30 


30 


30 


30 






Len 


30 


29 


30 


30 


30 






Li 


30 


29 


30 


30 


30 




n=100 


L1/2 


30 


30 


30 


30 


30 






Len 


30 


30 


30 


30 


30 


p = 0.1, 
o = 0.6 


n=50 


Li 

Ll/2 

Len 
Li 


30 
17 
18 
18 


30 
17 
19 
18 


30 
17 
17 
18 


30 
14 
16 
16 


30 
14 
14 
15 




n=80 


L1/2 


30 


29 


30 


28 


28 






Len 


30 


28 


30 


28 


27 






Li 


30 


28 


30 


27 


26 




n=100 


L1/2 


30 


30 


30 


30 


30 






Len 


30 


30 


30 


30 


30 


p = 0.4, 
o = 0.2 


n=50 


Li 

L1/2 
Len 


30 
19 
21 


30 
18 

22 


30 
18 
21 


28 
16 
17 


30 
15 
17 






Li 


18 


21 


19 


16 


17 




n=80 


Ll/2 


30 


30 


30 


30 


30 






Len 


30 


28 


30 


29 


29 






Li 


30 


27 


30 


29 


29 




n=100 


Ll/2 


30 


30 


30 


30 


30 






Len 


30 


30 


30 


30 


30 


p = 0.4, 
o = 0.6 


n=50 


Li 

Ll/2 

Len 


30 
14 
17 


30 
16 
17 


30 
15 
17 


29 
12 
12 


29 
12 
14 






Li 


17 


15 


14 


9 


13 




n=80 


Ll/2 


29 


25 


26 


28 


29 






Len 


28 


24 


24 


27 


24 






L, 


27 


24 


24 


23 


23 




n=100 


L,/2 


30 


29 


30 


30 


30 






Len 


30 


27 


28 


28 


30 






Li 


29 


27 


27 


28 


30 



Leukaemia dataset 

The original dataset was provided by Golub et al. [7], 
and contains the expression profiles of 7,129 genes for 
47 patients of acute lymphoblastic leukaemia (ALL) and 
25 patients of acute myeloid leukaemia (AML). For data 
preprocessing, we followed the protocol detailed in the 



supplementary information to Dudoit et al. [1]. After 
thresholding, filtering, applying a logarithmic transform- 
ation and standardizing each expression profile to zero 
mean and unit variance, a dataset comprising 3,571 
genes remained. 
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Table 4 Four publicly available gene expression datasets 
used in the experiments 



Dataset No. of genes No. of samples classes 



Leukaemia 


3571 


72 


ALL/AML 


Prostate 


5966 


102 


Normal^umor 


Colon 


2000 


62 


Normal^umor 


DLBCL 


6285 


77 


DLBCL/FL 



Prostate dataset 

This original dataset contains the expression profiles of 
12,600 genes for 50 normal tissues and 52 prostate 
tumor tissues. For data preprocessing, we adopt the pre- 
treatment method [20] to obtain a dataset with 102 sam- 
ples. And each sample contains 5966 genes. 

Colon dataset 

The colon microarray data set in Alon et al. [21] has 2000 
genes per sample and 62 samples which consist of 22 nor- 
mal tissues and 40 cancer tissues. The Colon dataset are 
available at http://microarray.princeton.edu/oncology. 

DLBCL dataset 

This dataset contains 77 microarray gene expression 
profiles of the 2 most prevalent adult lymphoid malig- 
nancies: 58 samples of diffuse large B-cell lymphomas 
(DLBCL) and 19 observations of follicular lymphoma 
(FL). Each sample contains 7,129 gene expression values. 
More information on these data can be found in Shipp 
MA et al. [22]. For data preprocessing, we followed the 
protocol detailed in the supplementary information to 
Dudoit et al. [1], and a dataset comprising 6,285 genes 
remained. 

We evaluate the prediction accuracy of the three pe- 
nalized logistic regression models using random parti- 
tion. This means that we divide the datasets at random 
such that approximate 70-80% of the datasets becomes 
training samples and the other 20-30% test samples. 
More information on these data is given in Table 5. For 
selecting the tuning parameter A, we employ the ten-fold 
cross validation scheme using the training set. We repeat 
this procedure 30 times and the averaged misclassifica- 
tion errors were reported in Table 6. Here the denomi- 
nators of the ten-fold cross validation errors and the test 
errors describe the sample size of training and test 

Table 5 The detail information of 4 microarray datasets 



used in the experiments 



Dataset 


No.of Training(class1/class2) 


No.of Testing(class1/class2) 


Leukaemia 


50(32 ALL/18 AML) 


22 (15 ALL/7 AML) 


Prostate 


71(35 ALL/36 AML) 


31(15 ALL/16 AML) 


Colon 


42(14 Normal/28 Tumor) 


20(8 Normal/12 Tumor) 


DLBCL 


60(45 DLBCL/15FL) 


17(13 DLBCL/4 FL) 



Table 6 The classification performances of different 



methods for 4 gene expression datasets 


Dataset 


Method 


Cross-validation 
error 


Test error 


No. of selected 
genes 


Leukaemia 


Ll/2 


2/50 


1/22 


2 




Len 


1/50 


1/22 


9 




Lt 


1/50 


1/22 


6 


Prostate 


Ll/2 


5/71 


3/31 


5 




Len 


5/71 


4/31 


34 




Lt 


5/71 


3/31 


25 


Colon 


Li/2 


4/42 


3/20 


5 




Len 


5/42 


4/20 


13 




Lt 


5/42 


4/20 


7 


DLBCL 


L 1/2 


3/60 


2/17 


14 




Len 


2/60 


1/17 


38 




Li 


3/60 


3/17 


23 



datasets respectively. The fractions of the ten-fold cross 
validation errors and the test errors and the number of 
gene selected are the approximated integers of the corre- 
sponding average number at 30 runs. As shown in 
Table 6, for Leukaemia dataset, the classifier with the L 1/2 
penalty gives the average ten-fold cross validation error 
of 2/50 and the average test error of 1/22 with about 2 
genes selected. The classifiers with L EN and L 2 methods 
give the average ten-fold cross validation errors of 1/50 
and the average test errors of 1/22 with about 9 and 6 
genes selected respectively. This means that all three 
methods can be successfully applied to high-dimensional 
classification problems and classify the Leukaemia 
dataset with same accuracies. Note that, the L 1/2 method 
achieved its success using only about 2 predictors 
(genes), compared to about 9 and 6 for the L EN and L 2 
methods. For Prostate and Colon datasets, it can be seen 
the L 1/2 method achieves the best classification perfor- 
mances with the highest accuracy rates using much 
fewer genes compared with those of the L EN and L 2 
methods. For DLBCL dataset, the L 1/2 logistic regression 
achieves better classification performance than that of 
the L 2 method and worse than that of the L EN method. 
However, as well as other three datasets, the L 1/2 
method achieved its success using much less predictors 
(about 14 genes), compared to about 38 and 23 for the 
L EN and L 2 methods. This is an important consideration 
for screening and diagnostic applications, where the goal 
is often to develop an accurate test using as few features 
as possible in order to control cost. 

Figures 1, 2 and 3 display the solution paths and the 
gene selection results of the three methods for the Pros- 
tate dataset in one sample run. Here the x-axis displays 
the number of running steps, the y-axis in the left sub- 
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Figure 1 The results of the sparse logistic regression with the L 1/2 penalty on Prostate dataset. The solution paths and the gene selection 
results of the sparse logistic L 1/2 penalty methods for the Prostate dataset in one sample run. 



figure is the coefficients measured gene importance and 
the y-axis in the right sub-figure is the misclassification 
errors based on the ten-fold cross validation. The opti- 
mal results of three methods are shown as vertical dot- 
ted lines. Figure 1 indicates that the number of nonzero 
coefficients (selected genes) of the optimal results 
obtained by the L 1/2 method is 5. In contrast, Figures 2 
and 3 indicate that the numbers of nonzero coefficients 
(selected genes) of optimal results obtained by the L EN 
and Li methods are 37 and 26 respectively. Generally 
speaking, the penalized logistic regression methods can 
be successfully applied to the cancer classification prob- 
lems with high dimensional and low samples microarray 
data, and our proposed L 1/2 method achieves better per- 
formance especially in gene selection. 

Brief biological analyses of the selected genes 

The summaries of the 10 top-ranked informative genes 
found by the three sparse logistic regression methods for 
4 gene expression datasets are shown in Tables 7, 8, 9 
and 10 respectively. The genes with star(*) are the most 
frequently selected genes to construct the classifiers 
according to the last column of Table 6, and the com- 
mon genes obtained by each classifier are emphasized 
with bold. The biologically experimental results proved 



some genes included in the frequently selected gene sets 
that produce high classification accuracy rate are mostly 
and functionally related to carcinogenesis or tumor 
histogenesis. For example, in Table 7, the most fre- 
quently selected gene set of each sparse logistic method 
for leukemia classification, including cystatin C (CST3) 
and myeloperoxidase (MPO) genes, that achieve high 
classification accuracy by the L 1/2 method, are experi- 
mentally proved to be correlated to leukemia of ALL or 
AML. The cystatin C gene is located at the extracellular 
region of the cell and has role in invasiveness of human 
glioblastoma cells. Decrease of cystatin C in the CSF 
might contribute to the process of metastasis and spread 
of the cancer cells in the leptomeningeal tissues [23]. 
The myeloperoxidase gene is taking role in anti- 
apoptosis process where cancer cells kill themselves [24]. 
For the colon dataset (Table 9), the most frequently se- 
lected gene set of each sparse logistic method includes 
genes such as guanylate cyclase activator 2B (GUCA2B), 
myosin, light chain 6, alkali, smooth muscle and non- 
muscle (MYL6) and Human desmin (DES) genes. These 
genes are the top 3 significant informative genes ranked 
by our proposed L 1/2 method and also selected by Ben- 
Dor et al. [25], Yang and Song [26] and Li et al. [27]. On 
the top of these genes lists is guanylate cyclase activator 




Liang et al. BMC Bioinformatics 2013, 14:198 
http://www.biomedcentral.com/1471 -21 05/1 4/1 98 



Page 9 of 12 




2B (GUCA2B) gene. Notterman et al [28] showed that a 
reduction of uroguanylin might be an indication of colon 
tumors, and Shailubhai et al [29] reported that treat- 
ment with uroguanylin has a positive therapeutic signifi- 
cance to the reduction in pre-cancerous colon ploys. 

In Tables 7, 8, 9 and 10, some genes are only fre- 
quently selected by the Ll/2 method, but not discovered 
by the L EN and L 2 methods. The evidence from the liter- 
atures showed that they are cancer related genes. For ex- 
ample, for the colon dataset, the genes cholinergic 
receptor, nicotinic, delta polypeptide (CHRND) and 
platelet/endothelial cell adhesion molecule- 1 (PEC AMI) 
were also selected by Maglietta R. et al. [30], Wiese A.H. 



et al. [31], Wang S. L. et al. [32], and Dai J. H. and Xu Q. 
[33]. These genes can significantly discriminate between non- 
dissected tumors and micro dissected invasive tumor cells. It 
is remarkable that apparently (to our knowledge) some dis- 
covered genes that have not been seen in any past studies. 

On the other hand, from Tables 7, 8, 9 and 10, we 
found that the most frequently selected genes and their 
ranking orders by the LEN and LI methods are much 
similar compared with those of the Ll/2 method. The 
main reasons are that the classification hypothesis needs 
not be unique as the samples in gene expression data lie 
in a high-dimensional space, and both of the LEN and 
LI methods are based on the LI type penalty. 



Table 7 The 10 top-ranked informative genes found by the three sparse logistic regression methods from the 
Leukaemia dataset 



Rank 


Gene description 








Ll/2 


Len 


Li 


1 


CST3 cystatin C * 


CFD complement factor D (adipsin) * 


CST3 cystatin C * 


2 


MPO myeloperoxidase * 


CST3 cystatin C * 


CFD complement factor D (adipsin) * 


3 


IL8 interleukin 8 


MPO myeloperoxidase * 


MPO myeloperoxidase * 


4 


GYPB glycophorin B (MNS blood group) 


DNTT deoxynucleotidyltransferase, 
terminal * 


IL8 interleukin 8 ' 


5 


IGL immunoglobulin lambda locus 


TCL1AT-cell leukemia/lymphoma 1A * 


DNTT deoxynucleotidyltransferase, 
terminal * 


6 


DNTT deoxynucleotidyltransferase, 
terminal 


IGL immunoglobulin lambda locus * 


TCL1AT-cell leukemia/lymphoma 1A* 


7 


LOC1 00437488 interleukin-8-like 


IL8 interleukin 8 ' 


IGL immunoglobulin lambda locus 


8 


LTB lymphotoxin beta (TNF superfamily, 
member 3) 


ZYX zyxin * 


LTB lymphotoxin beta (TNF superfamily, 
member 3) 


9 


TCRB T cell receptor beta cluster 


LTB lymphotoxin beta (TNF superfamily, 
member 3) * 


CD79A CD79a molecule, immunoglobulin- 
associated alpha 


10 


S100A9 S100 calcium binding protein A9 


CD79A CD79a molecule, immunoglobulin- 
associated alpha 


HBB hemoglobin, beta 



The genes with star(*) are the most frequently selected genes to construct the classifiers according to the last column of Table 6, and the common genes 
obtained by L 1/2 , Len , U classifiers are emphasized with bold. 
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Table 8 The 10 top-ranked informative genes found by the three sparse logistic regression methods from the Prostate 
dataset 



Rank 


Gene description 








u 1/2 


■-EN 




1 


SLC43A3 solute carrier family 
43, member 3 * 


AMOTL2 angiomotin like 2 * 


USP4 ubiquitin specific peptidase 
4 (proto-oncogene) * 


2 


CD22 CD22 molecule * 


USP4 ubiquitin specific peptidase 4 (proto- 
oncogene) * 


CD22 CD22 molecule * 


3 


KHDRBS1 KH domain containing, RNA binding, 
signal transduction associated 1 * 


EIF4EBP2 eukaryotic translation initiation 
factor 4E binding protein 2 * 


EIF4EBP2 eukaryotic translation 
initiation factor 4E binding protein 2 * 


4 


ZNF787 zinc finger protein 787 * 


PRAF2 PRA1 domain family, member 2 * 


Gene symbol:AA683055, probe set: 
3471 1_at* 


5 


GMPR guanosine monophosphate reductase * 


CACYBP calcyclin binding protein * 


AMOTL2 angiomotin like 2 * 


6 


AMOTL2 angiomotin like 2 


Gene symbol:AA683055, probe set: 3471 1_at * 


VSNL1 visinin-like 1 * 


7 


EIF4EBP2 eukaryotic translation initiation 
factor 4E binding protein 2 


VSNL1 visinin-like 1 * 


FLNC filamin C, gamma * 


8 


USP2 ubiquitin specific peptidase 2 


SLC43A3 solute carrier family 43, member 3 


PRAF2 PRA1 domain family, member 2 * 


9 


USP4 ubiquitin specific peptidase 
4 (proto-oncogene) 


CD22 CD22 molecule * 


CACYBP calcyclin binding protein * 


10 


ACTN4 actinin, alpha 4 


TMC01 transmembrane and 
coiled-coil domains 1 * 


SLC43A3 solute carrier 
family 43, member 3 * 



The genes with star(*) are the most frequently selected genes to construct the classifiers according to the last column of Table 6, and the common genes 
obtained by L 1/2 , Len / L-i classifiers are emphasized with bold. 



Construct KNN classifier with the most frequently 
selected relevant genes 

In this section, to further evaluate the performance and 
prediction generality of the sparse logistic regression 
with L 1/2 penalty, we constructed KNN (k =3, 5) 



classifiers using the relevant genes which were most fre- 
quently selected by the L 1/2 penalized logistic regression 
method. In this experiment, we use the random leave- 
one-out cross validation (LOOCV) to evaluate the pre- 
dictive ability and repeat 50 runs. 



Table 9 The 10 top-ranked informative genes found by the three sparse logistic regression methods from the colon 
dataset 



Rank 


Gene description 








Ll/2 


Len 


Li 


1 


GUCA2B guanylate cyclase activator 
2B (uroguanylin) * 


GUCA2B guanylate cyclase 
activator 2B (uroguanylin) * 


GUCA2B guanylate cyclase 
activator 2B (uroguanylin) * 


2 


MYL6 myosin, light chain 6, alkali, 
smooth muscle and non-muscle * 


MYH9 myosin, heavy chain 9, non-muscle * 


ATPsyn-Cf6 ATP synthase-coupling 
factor 6, mitochondrial * 


3 


DES desmin * 


DES desmin * 


MYH9 myosin, heavy chain 9, non-muscle * 


4 


CHRND cholinergic receptor, nicotinic, 
delta polypeptide * 


MYL6 myosin, light chain 6, alkali, 
smooth muscle and non-muscle * 


GSN gelsolin * 


5 


PECAM1 platelet/endothelial cell 
adhesion molecule-1 * 


GSN gelsolin * 


MYL6 myosin, light chain 6, alkali, 
smooth muscle and non-muscle * 


6 


ATPsyn-Cf6 ATP synthase-coupling 
factor 6, mitochondrial 


COL1 1 A2 collagen, type XI, alpha 2 * 


COL1 1 A2 collagen, type XI, alpha 2 * 


7 


ATF7 activating transcription factor 7 


ATPsyn-Cf6 ATP synthase-coupling 
factor 6, mitochondrial * 


MXI1 MAX interactor 1, dimerization protein * 


8 


PROBABLE NUCLEAR ANTIGEN (Pseudorabies 
virus)[accession number:T86444] 


ssb single-strand binding protein * 


UQCRC1 ubiquinol-cytochrome c 
reductase core protein I * 


9 


MYH9 myosin, heavy chain 9, non-muscle 


Sept2 septin 2 * 


DES desmin * 


10 


MYH10 myosin, heavy chain 10, non-muscle 


MXI1 MAX interactor 1, dimerization protein * 


ZEB1 zinc finger E-box binding homeobox 1* 



The genes with star(*) are the most frequently selected genes to construct the classifiers according to the last column of Table 6, and the common genes 
obtained by L 1/2 , L EN , Lt classifiers are emphasized with bold. 
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Table 10 The 10 top-ranked informative genes found by the three sparse logistic regression methods from the DLBCL 
dataset 



Rank 


Gene description 








Ll/2 


l-EN 


Li 


1 


CCL21 chemokine (C-C motif) ligand 21 * 


MTH1 metallothionein 1H * 


MTH1 metallothionein 1H * 


2 


HLA-DQB1 major histocompatibility complex, class II, 
DQ beta 1 * 


MT2A metallothionein 2A * 


MT2A metallothionein 2A * 


3 


MT2A metallothionein 2A * 


SFTPA1 surfacant protein Al * 


CCL21 chemokine (C-C motif) ligand 

2 * 


4 


THRSP thyroid hormone responsive * 


TCL1AT-cell leukemia/lymphoma 1A * 


SFTPA1 surfacant protein Al * 


5 


Igj immunoglobulin joining chain * 


ZFP36L2 ZFP36 ring finger protein-like 2 * 


POLD2 polymerase (DNA directed), 
delta 2, accessory subunit * 


6 


TCL1AT-cell leukemia/lymphoma 1A * 


FCGR1 A Fc fragment of IgG, high affinity 
la, receptor (CD64) * 


Igj immunoglobulin joining chain * 


7 


GOT2 glutamic-oxaloacetic transaminase 2, 
mitochondrial (aspartate aminotransferase 2) * 


Igj immunoglobulin joining chain * 


MELK maternal embryonic leucine 
zipper kinase * 


8 


Plod procollagen lysyl hydroxylase * 


TRB2 Homeodomain-like/winged-helix 
DNA-binding family protein * 


CKS2 CDC28 protein kinase regulatory 
subunit 2 * 


9 


STXBP2 syntaxin binding protein 2 * 


MELK maternal embryonic leucine zipper 
kinase * 


EIF2A eukaryotic translation initiation 
factor 2A, 65kDa * 


10 


SFTPA1 surfacant protein Al * 


CCL21 chemokine (C-C motif) ligand 2 * 


AQP4 aquaporin 4 * 



The genes with star(*) are the most frequently selected genes to construct the classifiers according to the last column of Table 6, and the common genes 
obtained by L 1/2 , L EN , Lt classifiers are emphasized with bold. 



Table 11 summarizes classification accuracies of four 
datasets with KNN classifiers with selected genes by 
our proposed methods. From Table 11, we can see 
that all the classification accuracies are high than 90%, 
especially the classification accuracy on the Leukaemia 
dataset is 98.3%. The KNN classifiers with relevant 
genes which were selected by the sparse logistic re- 
gression with the L 1/2 penalty can achieve high classifi- 
cation accuracy. The results indicate that the sparse 
logistic regression with the L 1/2 penalty can select 
power discrimination genes. 

Conclusions 

In cancer classification application based on micro- 
array data, only a small subset of genes is strongly 
indicative of a targeted disease. Thus, feature selec- 
tion methods play an important role in cancer classi- 
fication. In this paper, we propose and model sparse 



Table 11 Summary of the results of KNN classifiers using 
the most frequently selected genes by our proposed L n/2 
penalized logistic regression method 


Methods 


K-NN(k=3) 


K-NN(k=5) 


Leukaemia 


98.3% 


94.4% 


Prostate 


95.1% 


94.2% 


Colon 


95.1% 


90.6% 


DLBCL 


94.8% 


91.2% 



logistic regression with the L 1/2 penalty, and develop 
the corresponding coordinate descent algorithm as a 
novel gene selection approach. The proposed method 
utilizes a novel univariate half thresholding to update 
the estimated coefficients. 

Both simulation and microarray data studies show that 
the sparse logistic regression with the L 1/2 penalty 
achieve higher classification accuracy than those of or- 
dinary Li and elastic net regularization approaches, 
while fewer but informative genes are selected. There- 
fore, the sparse logistic regression with the L 1/2 penalty 
is the effective technique for gene selection in real classi- 
fication problem. 

In this paper, we use the proposed method to solve 
binary cancer classification problem. However, many 
cancer classification problems involve multi- category 
microarray data. We plan to extend our proposed 
method to solve multinomial penalized logistic re- 
gression for multiclass cancer classification in our 
future work. 
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